Efficient Learning in Large-Scale Combinatorial Semi-Bandits

ثبت نشده
چکیده

= ̃ O ⇣ K p dnmin {ln(L), d} ⌘ . (11) We now outline the proof of Theorem 3, which is based on (Russo & Van Roy, 2013; Dani et al., 2008). Let H t denote the “history” (i.e. all the available information) by the start of episode t. Note that from the Bayesian perspective, conditioning on H t , ✓⇤ and ✓ t are i.i.d. drawn from N( ̄ ✓ t ,⌃ t ) (see (Russo & Van Roy, 2013)). This is because that conditioning on H t , the posterior belief in ✓⇤ is N( ̄ ✓ t ,⌃ t ) and based on Algorithm 2, ✓ t is independently sampled from N( ̄ ✓ t ,⌃ t ). Since ORACLE is a fixed combinatorial optimization algorithm (even though it can be independently randomized), and E,A, are all fixed, then conditioning on H t , A⇤ and At are also i.i.d., furthermore, A⇤ is conditionally independent of ✓ t , and At is conditionally independent of ✓⇤. To simplify the exposition, 8✓ 2 R and 8A ✓ E, we define g(A, ✓) = X e2A h e , ✓i , (12) then we have E[f(A⇤,w t )|H t , ✓⇤, ✓ t , A⇤, At] = g(A⇤, ✓⇤) and E[f(A,w t )|H t , ✓⇤, ✓ t , A⇤, At] = g(At, ✓⇤), hence we have E[R t |H t ] = E[g(A⇤, ✓⇤) g(At, ✓⇤)|H t ]. We also define the upper confidence bound (UCB) function U t : 2 E ! R as U t (A) = X e2A ⌦

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Learning in Large-Scale Combinatorial Semi-Bandits

• the agent knows a generalization matrix Φ ∈ <L×d s.t. w̄ = EP [wt] is “close” to span[Φ] • such models are available in many cases Performance Metrics At each time t, choosing At ∈ A can be challenging, since the combinatorial optimization problem maxA∈A ∑ e∈A w(e) can be NP-hard. We assume the agent uses a combinatorial optimization algorithm ORACLE to choose At, where ORACLE can be an approx...

متن کامل

Tight Regret Bounds for Stochastic Combinatorial Semi-Bandits

A stochastic combinatorial semi-bandit is an online learning problem where at each step a learning agent chooses a subset of ground items subject to constraints, and then observes stochastic weights of these items and receives their sum as a payoff. In this paper, we close the problem of computationally and sample efficient learning in stochastic combinatorial semi-bandits. In particular, we an...

متن کامل

Matroid Bandits: Practical Large-Scale Combinatorial Bandits

A matroid is a notion of independence that is closely related to computational efficiency in combinatorial optimization. In this work, we bring together the ideas of matroids and multiarmed bandits, and propose a new class of stochastic combinatorial bandits, matroid bandits. A key characteristic of this class is that matroid bandits can be solved both computationally and sample efficiently. We...

متن کامل

Importance Weighting Without Importance Weights: An Efficient Algorithm for Combinatorial Semi-Bandits

We propose a sample-efficient alternative for importance weighting for situations where one only has sample access to the probability distribution that generates the observations. Our new method, called Recurrence Weighting (RW), is described and analyzed in the context of online combinatorial optimization under semi-bandit feedback, where a learner sequentially selects its actions from a combi...

متن کامل

Semi-Bandits with Knapsacks

We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks and combinatorial semi-bandits. The former concerns limited “resources” consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, supp...

متن کامل

Online Influence Maximization under Independent Cascade Model with Semi-Bandit Feedback

We study a stochastic online problem of learning to influence in a social network with semi-bandit feedback, individual observations of how influenced users influence others. Our problem combines challenges of partial monitoring, because the learning agent only observes the influenced portion of the network, and combinatorial bandits, because the cardinality of the feasible set is exponential i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015